A semantic union model for open domain Chinese knowledge base question answering

In Open-domain Chinese Knowledge Base Question Answering (ODCKBQA), most common simple questions can be answered by a single relational fact in the knowledge base (KB). The abbreviations, aliases, and nesting of entities in Chinese question sentences, and the gap between them and the structured semantics in the knowledge base, make it difficult for the system to accurately return answers. This study proposes a semantic union model (SUM), which concatenates candidate entities and candidate relationships, using a contrastive learning algorithm to learn the semantic vector representation of question and candidate entity-relation pairs, and perform cosine similarity calculations to simultaneously complete entity disambiguation and relation matching tasks. It can provide information for entity disambiguation through the relationships between entities, avoid error propagation, and improve the system performance. The experimental results show that the system achieves a good average F1 of 85.94% on the dataset provided by the NLPCC-ICCPOL 2016 KBQA task.

In SP methods, the goal is to convert NL questions into equivalent logical expressions according to a specific grammar, complete the query of the KB, and obtain the answers [9][10][11] . Since the open-domain Chinese KB contains hundreds of thousands of relations, the SP methods face the problem related to the unregistered relation words. In these methods, the training set may face difficulty in covering such a large number, making it limited in ODCKBQA.
The IR methods first accurately locate the entities in the question, then maps the entities to the knowledge base, obtains all the connected relation and attribute value entities, and gets the answers by calculating the similarity between the question and them 2,12-20 . Bordes et al. 2 proposed the vector embedding-based method to encode questions and answers, calculate the semantic similarity between the two, and sort them. Li et al. 12 designed a multi-column convolutional neural network to capture the interactive information between questions and answers. Xie et al. 14 apply Deep Structured Semantic Models (DSSM) 15 based on convolutional neural network and bidirectional long short-term memory (BiLSTM) 16 to calculate the similarity between the question and relationships. Lai et al 17 . used the word frequency and length features of entity to find entity mention in question and their corresponding entity in the KB, and then matched the corresponding relation based on word2vec word embedding cosine similarity and relational word attention methods. Later, a shallow method based on features and word embedding was proposed to generate candidate entities and relationships, and then deep CNNs were used to reorder these entity-relation pairs 18 .
With the development of pre-trained language models, some studies 6,19-21 have used pre-trained language models to construct ODCKBQA. Liu et al. 6 used a pre-trained language model BERT 22 to learn the semantic representation of questions and candidate words. Li et al. 19 added the loss function in the entity mention recognition task and the relationship matching task to conduct joint modeling, trained the BERT model with shared parameters, and used the output of KB entities, text fuzzy matching and n-gram information to complete entity link, but they did not fully consider the ambiguity of entities with the same name but different semantics. Lin et al. 20 used unsupervised and fine-tuning methods to train the MT5 model to obtain the ability to convert answer sentences constructed through triples into question, and used the Roformer model to determine whether candidate sentences and question were similar or dissimilar. It uses fuzzy matching to search for candidate entities and their triples from KB based on the entity mention in the question. This method ignores ambiguous entities with the same name and does not consider the impact of ambiguous entities in the evaluation of the overall system. These models mentioned above are unable to effectively consider and solve the impact of ambiguity between entities with the same name on ODCKBQA, and cannot accurately distinguish problems and candidate words with similar texts but significant semantic differences. Thus, this study proposes using the CoSENT model to learn more discriminative semantic vector representations of questions and candidate entity-relation pairs. Furthermore, we integrate entity disambiguation and relation-matching tasks into a unified SUM framework. Figure 1 shows the overall ODCKBQA framework, comprising three subtasks: entity mention recognition, entity disambiguation, and relation matching. The mention2id dict provides candidate entities for entity disambiguation tasks. The entity reference recognition module identifies the subject entity reference that contains information from the input NL questions. However, entity mentions in NL interrogative sentences often represent multiple meanings, and entity disambiguation must find the exact corresponding entity in the KB. Moreover, the relations in question usually have different surface forms and are not easy to match the relations in the KB. The mismatch between NL questions and structured semantic knowledge base is a key challenge in ODCKBQA. We propose the CoSENT model to learn deeper semantic features and distinguish this semantic difference. Finally, www.nature.com/scientificreports/ the answer extraction module extracts answers from the KB using query statements through the entities and relationships obtained previously. Traditional methods treat the two tasks of entity disambiguation and predicate matching as independent subtasks, ignoring their dependencies. Intuitively, candidate entities connected by similar predicates offer more information for entity disambiguation tasks and vice versa. When they act as independent tasks, error propagation will occur and subsequently affect the overall system performance. Thus, we propose a SUM that combines entity disambiguation and relation-matching tasks in a unified framework, considering a full account of the correlation between the two tasks.

Models and methods
Base model. This section describes the BERT and CoSENT models used in this article.
BERT. Figure 2 shows the structure of the BERT model. The model input vector consists of three parts: Token Embeddings, Segment Embeddings, and Position Embeddings. Moreover, BERT adds a special [CLS] tag before the input sentence sequence, and the output vector corresponding to this tag is used as the semantic representation q 1 , q 2 , · · · , q N of the entire input sequence, usually used for classification tasks. Then, the model adds a special [SEP] tag after the sentence sequence token for sentence segmentation. We input the sequence representation of the question Q into the BERT model to get the vector representation of each word in the sentence: where H q = h cls , h 1 , h 2 , · · · , h N , h seq , N is the length of the input sequence Q, and h i is the output vector representation of the BERT layer, corresponding to the i th word.
CoSENT modele. The structure of the CoSENT model is similar to Sentence BERT 23 , uses two parametershared BERTs to form a Siamese neural network. The CoSENT model outputs respective semantic vectors of input sentences U and V . Then, it pools them to derive fixed-size sentence embeddings and uses a cosine similarity function for similarity calculations and the cosine similarity formula as shown in Eq. (2): In the training phase of the CoSENT model, h + is the set of all positive sample pairs, and h − is the set of all negative sample pairs. For any positive sample pair h i , h j ∈ h + and negative sample pair (h k , h l ) ∈ h − , we develop the following: where u i , u j , u k , u j is the sentence vector of h i , h j , h k , h l respectively. The loss function of the CoSENT model is shown in Eq. (4): www.nature.com/scientificreports/ Among them, is a hyperparameter greater than 0, taken as 15 in the subsequent experiments. The loss function is used to pull the representation of semantics of similar sentence pairs in the vector space and to move dissimilar sentence pairs in the retraining process to obtain a more discriminative sentence vector representation. Figure 3 shows the SUM framework, which uses the CoSENT to learn semantic vector representations of questions and entity-relation pairs to match entities and relations candidate fact triples, considering deeper semantic features. Note that question and entity-relation pairs use a BERT model with shared parameters to output semantic vectors.

Semantic union model.
First, we connect each candidate entity e i in the candidate entity set E = {e 1 , e 2 , · · · , e l } and its connected predicate set R i = r i 1 , r i 2 , · · · , r i n , through a special [AND] identifier to form the candidate entity-relation pairs set C = e 1 r 1 1 , · · · , e 1 r 1 n , e 2 r 2 1 , · · · , e l r l n . Second, question Q and the candidate entity-relation pair set C are input into the BERT layer to obtain their vector representations. Then, these vectors are fed into pooling layers separately to obtain fixed-size sentence embeddings, expressed as the following: The pooling layer uses the average pooling strategy by default and the cosine similarity function to calculate their similarity: where sim_s is the set of similarity scores between the question and the candidate entity-relation pair, and our minimized objective loss function is the same as Eq. (4).
Intuitively, some candidate relations provide semantic information for entity disambiguation. If we know the relationship in the question, we can exclude some candidate entities through their semantic information. For example, the question "How many pages do a dream of Red Mansions have?" contains the relative word "number of pages" corresponding to the word "how many pages." For entity disambiguation, it is reasonable to focus on candidate entities connected with "pages," such as "Dream of Red Mansions (novel)" rather than "Dream of Red Mansions (movie)." Therefore, we constructed a SUM to perform entity disambiguation and relation matching.

Entity mention recognition.
We used the BIO standard strategy to represent each word in the question. The entity mention recognition task is to identify the subject entity mentioned. We constructed a BERT-BiLSTM-CRF model with question Q as the input sequence, which consisted of a BERT layer, BiLSTM layer, and CRF layer, where the BERT layer structure was the same as that shown in Fig. 1. We input the sequence representation q 1 , q 2 , · · · , q N of the question Q into the BERT-BiLSTM-CRF model to obtain the label probability distribution of each word in the sentence: where Y is the label probability distribution predicted by the model. We chose the label with the highest probability as the label of the word. We took the fields labeled B and I as the entity mentioned output for the BIO standard strategy.
Entity disambiguation and relation matching. Since the entity mentioned in the question corresponds to multiple entities that have different meanings in the KB, entity disambiguation operations map the entity mentions in the question with a known unambiguous entity in the KB. Given the question Q = {x 1 , x 2 , · · · , x n } www.nature.com/scientificreports/ and the candidate entity set E = {e 1 , e 2 , · · · , e l } , we use the CoSENT model to calculate the similarity between them and rank them, as shown in Eq. (9): where P e is the semantic similarity score between the question and the candidate entity. Most entities in the KB are connected with multiple relationships. The relation-matching task scores each candidate relation according to the semantic similarity between each candidate relation of the question and the entity to identify the relation word that best matches the semantics of the question. After the entity disambiguation task, we obtain all its connected relations from KB, according to entity mentions, which form a candidate relation set R = {r 1 , r 2 , · · · , r n } , where n is the number of candidate relations. We used the CoSENT model to obtain the semantic similarity score between question Q and the candidate relation r i , as shown in Eq. (10): where P r is the semantic similarity score between the question Q and the candidate relation set R.
The above process executes the entity disambiguation task and the relation matching task, leading to the error's transmission. If the entity selected by the entity disambiguation model deviates from the question, the relation-matching model will fail to find the correct relationship, thereby unable to find the correct answer in the KB. Here, the information from the relation-matching stage cannot be used in the entity disambiguation process. For example, some candidate entities do not have the correct relationship, which may still be selected in the entity disambiguation task, eventually leading to wrong results. Thus, we proposed SUM to complete the joint task of entity disambiguation predicate matching and calculated the semantic similarity of candidate entity-relation pairs and questions.
We performed fuzzy matching in the Neo4j graph database through the entity mentioned in the question to obtain candidate entity-relationship pairs. Then, we used the mention2id dictionary to filter them, retaining only the candidate entities and their relationships corresponding to the dictionary entity mentions. We also formed a set C of candidate entity-relationship pairs. With the SUM model, we calculated the semantic similarity between the question and the candidate entity-relation pair set. Then, we selected the top N candidate entity-relation pairs: where P er is the semantic similarity score of Q and set C . We selected the candidate entity-relation pair with the highest score and obtained the corresponding answer from the Neo4j graph data through the CQL query statement for the answer.

Experiment
We described the KB, data sets, parameter settings, and evaluation indicators. Then, we present the experimental results and analysis.

Experimental setup. Knowledge base introduction.
We gathered our dataset from the NLPCC ICCPOL 2016 KBQA datasets, which contained a training set of 14,609 question-answer pairs and a test set of 9870 question-answer pairs. This dataset provides a KB and a mention2id entity ambiguity dictionary, in which the KB contains 6,502,738 entities, 587,875 relations, and 43,063,796 triples. Each line in the KB file stores a text file, comprising a triple (entity, relationship, entity), and the mention2id dictionary includes 7,623,034 entity-entity pairs. The content of the KB is shown in Table 1.
Datasets. The experiments are based on the dataset collected from the NLPCC ICCPOL 2016 KBQA datasets, comprising entity mentions, relations, and answers to questions. For the entity mention recognition task, we labeled the entity mentions in the question using the BIO notation, based on the entity mentions provided by the original dataset. For the entity disambiguation task, we obtained the candidate entities according to the mention2id dictionary and the mention of the question. We also queried the corresponding entity in the KB through the answer and relationship of the question, marked it as a positive example, and marked other candidate entities as a negative example. For the relation-matching task, we fetched all the relations connected to the correct entity from the KB and labeled the correct relations as positive examples and other relations as negative examples. For the joint task of entity disambiguation relation matching, we performed the fuzzy matching in the Neo4j graph database based on mentions to obtain candidate entity-relation pairs, which were filtered using the    Table 2 shows the final datasets of each subtask.
Parameters. We used the Chinese BERT base model to initialize the weights. For all models, we set the maximum sequence length to 64, the batch size to 32, and the epoch to 20. We minimized the loss function using Adam, and the learning rate was set to 2e-5. Then, we set the hyperparameter of Eq. (4) to (15).
Evaluation metrics. We used AverageF 1 to evaluate the KBQA system performance. The formula AverageF 1 is defined as the following: where F i represents the F1 score for a question Q i ; F i is set to 0 if the generated answer set C i for Q i is empty or does not overlap the golden answers A i for Q i . Otherwise, formulate F i as follows: where #(C i , A i ) represents the number of answers that appear in both C i and A i ; |C i | and |A i | denote the number of answers in C i and A i , respectively. Accuracy@N represents the average accuracy of the candidate set with the topN scores containing the correct results.
Experimental results and analysis. For the entity mention recognition module, we used the BERT-BIL-STM-CRF model to identify entity mentions in question sentences. We achieved entity-level accuracy of 97.41% using the BERT-BILSTM-CRF model, and 98.05% after adding manual rules. The next step is to analyze the results of the following experiments: (1) As revealed in Table 3, the CoSENT model in the entity disambiguation task is superior to other models, assisting in obtaining deeper semantic information. In the training stage, the CoSENT model optimizes the cos value of two sentences to obtain more differentiated semantic information. Compared with the CoSENT model, the BERT model and the Sentence-BERT model record a drop in performance by 0.73% and 3.04%, respectively, when using the classification model in the training phase. The ability of CoSENT model to extract Semantic information is better than Siamese BiLSTM and Siamese CNN models built using traditional neural networks.
(2) Table 4 presents the experimental results of the relation-matching task. Since entity mention in question may affect the effect of model learning, we conduct a set of experiments on whether entity mention in questions carries mask operation. The experimental results show that after masking the entity mention of the questions in the dataset, the effect of the model is improved, and the BERT-Softmax(mask) of the interactive model is slightly better than the CoSENT(mask) model of the representation model, with the best performance. In the representation model, CoSENT based on contrastive learning outperforms Siamese BiLSTM and Sentence BERT models, and proved the superiority of contrastive learning loss.
(3) As shown in Table 5, the experimental results of the Entity disambiguating relation matching joint task show that the mask operation has a certain effect on the entity mentioned in the question and candidate entityrelation pairs. The effect of the CoSENT model is 0.12% higher than that of the BERT model,2.07% higher than  (4) We also performed the experiments on the NLPCC ICCPOL 2016 KBQA datasets, and the evaluation index used in the final results of the official evaluation was the average F1 value. The overall system uses the BERT-BILSTM-CRF model in the entity reference identification module and performs mask operations in the relation matching and joint task models. The final overall KBQA results are shown in Table 6. The experimental results show that an SUM model, which is an entity disambiguation relation matching task, has advantages over the pipeline in ODCKBQA. Table 7 compares all the results 6,14,[17][18][19][20]24,25 , which participate in the NLPCC-ICCPOL 2016 KBQA evaluation task. The experimental results show that the average F1 score of our proposed SUM is 85.94%, which is superior to other pipeline models that using many artificial feature rules 14,17 , LSTM, CNN 24,25 , and BERT 6 . In paper 18, Lai et al. did not consider sentences with defective entities, but instead screened 9782 data out of 9870 for experiments, resulting in a relatively high average F1 score. The reason why papers 19 and 20 achieved such high results is that they did not consider the impact of ambiguity of entities with the same name, and only used fuzzy matching and other methods to find relevant entities in KB, while we fully considered the entity disambiguation task.

Conclusion
We proposed a SUM to construct ODCKBQA. The proposed SUM fully considers the impact of ambiguity between entities with the same name, combines entity disambiguation and relation matching tasks within a unified framework, and uses a CoSENT model based on contrastive learning to learn deeper and more discriminative semantic vector representations. Through experimental results on the NLPCC ICCPOL 2016 KBQA datasets, prove the advantages of our proposed SUM model.